improve CUDA cpy memory bandwidth when copying transposed tensor #16841

bssrdf · 2025-10-29T12:54:17Z

While working on #15805, I noticed the current cpy_flt kernel has significant uncoalesced global access.

This is particularly bad if one tries to make a transposed tensor contiguous by cur = ggml_cont(ctx, ggml_transpose(ctx, cur));. Some simple benchmarks in test-bankend-ops: perf -o CPY on 4090

master using permute to simulate transpose

 CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                132 runs -  7642.58 us/run -  1572864 kB/run -  200.73 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                172 runs -  6662.13 us/run -   786432 kB/run -  113.89 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0]):                344 runs -  2966.67 us/run -   786432 kB/run -  255.75 GB/s

Using shared memory is a common way to make coalesced global memory access. I implemented another copy kernel and get 2x-4x boost.
This PR with src tensor transposed

 CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                484 runs -  2119.14 us/run -  1572864 kB/run -  723.92 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):                817 runs -  1276.55 us/run -   786432 kB/run -  594.35 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]):               1118 runs -   917.03 us/run -   786432 kB/run -  827.37 GB/s

Currently I use src tensor's OP == TRANSPOSE to let ggml_cpy_flt_cuda pick customized transposed copy. I am not sure if there is a better way to make this choice. Your suggestions are welcome.

…void uncoalesced access; test cases also added shwoing improved memory bandwidth

CISC · 2025-10-29T13:03:50Z

Any reason for not including BF16 as well?

bssrdf · 2025-10-29T13:29:05Z

Any reason for not including BF16 as well?

Just added support for BF16. Other quantized types may also have this problem and the fix is not as straightforward as FP32 etc.

CISC · 2025-10-29T13:41:57Z

Other quantized types may also have this problem and the fix is not as straightforward as FP32 etc.

Probably only relevant for quantized KV cache (perhaps not even then, unsure), so not a big issue.

am17an · 2025-10-29T17:23:49Z

@bssrdf this is failing the CI, do you mind taking a look?

bssrdf · 2025-10-29T17:30:47Z

@bssrdf this is failing the CI, do you mind taking a look?

Yeah, something is wrong. Using the OP equality to make a decision is likely not robust. Will look into it. Thanks.

Edit: I ran ci/run.sh locally on my machine: wsl2 ubuntu. The rerank test passed.

rerank score 0:    0.171
rerank score 1:    0.169
rerank score 2:    0.189

0.00.608.024 I llama_perf_context_print:        load time =     321.37 ms
0.00.608.025 I llama_perf_context_print: prompt eval time =     105.71 ms /    62 tokens (    1.70 ms per token,   586.52 tokens per second)
0.00.608.026 I llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
0.00.608.027 I llama_perf_context_print:       total time =     106.26 ms /    63 tokens
0.00.608.027 I llama_perf_context_print:    graphs reused =          0

real    0m0.864s
user    0m0.399s
sys     0m0.355s
  - rerank score 0 @ 0.171 OK
  - rerank score 1 @ 0.169 OK
  - rerank score 2 @ 0.189 OK

I am not sure what 's going on. The 3 score are the same, but it passed on my machine and failed at CI.

am17an · 2025-10-30T04:02:39Z

Try running it through compute-sanitizer

JohannesGaessler

In terms of memory bandwidth for FP16 and BF16, 32*2=64 bytes is still suboptimal vs. the 64*2=128 bytes that you could be transferring. The shared memory banks similarly have the issue where they are best suited for transfers of 4, 8, or 16 bytes. But then the handling of the 2 byte datatypes becomes more tricky. In mma.cuh there is a function ggml_cuda_movmatrix that you can use to transpose a matrix. To be clear, what I'm suggesting is optional and we can also move towards merging the kernel as-is.

bssrdf · 2025-10-30T16:02:09Z

In terms of memory bandwidth for FP16 and BF16, 322=64 bytes is still suboptimal vs. the 642=128 bytes that you could be transferring. The shared memory banks similarly have the issue where they are best suited for transfers of 4, 8, or 16 bytes. But then the handling of the 2 byte datatypes becomes more tricky. In mma.cuh there is a function ggml_cuda_movmatrix that you can use to transpose a matrix. To be clear, what I'm suggesting is optional and we can also move towards merging the kernel as-is.

@JohannesGaessler, thanks for the comments. I realized the logic of triggering transposed copy is not right. I am working on a more general way of copying. The bottom line is to avoid uncoalesced access as much as possible, which really reduces the bandwidth.

JohannesGaessler · 2025-11-02T07:17:57Z

Is this PR in a state where you want to move towards merging it?

… cuda-transpose-cpy

bssrdf · 2025-11-02T12:52:14Z

Is this PR in a state where you want to move towards merging it?

Made another push restoring some old test cases. Now ready to move forward. Thanks.

JohannesGaessler

Is the cpy_flt_transpose kernels still needed for a use case that cannot be covered by cpy_flt_coalesced?

ggml/src/ggml-cuda/cpy.cu

bssrdf · 2025-11-03T16:22:39Z

Is the cpy_flt_transpose kernels still needed for a use case that cannot be covered by cpy_flt_coalesced?

@JohannesGaessler , thank you for the review. I'll need to work on it one by one.
The original intent is only to improve transposed copy, but there are some cases where permute is composed with transpose and
cpy_flt_transpose failed. Maybe I just need a more strict condition check. cpy_flt_transpose is faster than cpy_flt_coalesced, while cpy_flt_coalesced can handle more general permuted cases.

…gestions

bssrdf · 2025-11-03T19:51:20Z

I have decided to remove cpy_flt_coalesced and focus on the transpose case solely. The logic in cpy_flt_coalesced is too complicated and not very clean.

Permute is also handled if it is in fact a transpose. Code has been updated with your suggestions.

Current master

  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               132 runs -  7841.80 us/run -  1572864 kB/run -  195.63 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               172 runs -  6818.98 us/run -   786432 kB/run -  111.27 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               344 runs -  3060.79 us/run -   786432 kB/run -  247.88 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):                     344 runs -  3059.14 us/run -   786432 kB/run -  248.02 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               132 runs -  7836.65 us/run -  1572864 kB/run -  195.76 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               330 runs -  3153.94 us/run -  1572864 kB/run -  486.40 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               172 runs -  6802.20 us/run -   786432 kB/run -  111.54 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               344 runs -  3057.42 us/run -   786432 kB/run -  248.16 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):                     344 runs -  3056.52 us/run -   786432 kB/run -  248.23 GB/s

This PR

  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               462 runs -  2202.66 us/run -  1572864 kB/run -  696.47 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               731 runs -  1400.65 us/run -   786432 kB/run -  541.69 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):              1075 runs -   948.73 us/run -   786432 kB/run -  799.72 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):                    1032 runs -   972.86 us/run -   786432 kB/run -  779.89 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               462 runs -  2190.42 us/run -  1572864 kB/run -  700.37 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               506 runs -  2034.23 us/run -  1572864 kB/run -  754.14 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               688 runs -  1458.58 us/run -   786432 kB/run -  520.18 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):              1075 runs -   950.66 us/run -   786432 kB/run -  798.10 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):                    1032 runs -   972.45 us/run -   786432 kB/run -  780.21 GB/s

JohannesGaessler · 2025-11-04T09:57:48Z

ggml/src/ggml-cuda/cpy.cu

@@ -1,3 +1,4 @@
+#include <algorithm>


Suggested change

#include <algorithm>

I think this header is no longer needed.

Yes, removed.

I also made a few adjustments to reduce bank conflicts for fp16 and bf16 types. Seems getting a small bump in bandwidth for those types.

CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0): 462 runs - 2247.70 us/run - 1572864 kB/run - 682.52 GB/s CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0): 731 runs - 1392.03 us/run - 786432 kB/run - 545.05 GB/s CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0): 1118 runs - 922.83 us/run - 786432 kB/run - 822.17 GB/s CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0): 1118 runs - 923.18 us/run - 786432 kB/run - 821.86 GB/s CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1): 484 runs - 2143.72 us/run - 1572864 kB/run - 715.62 GB/s CPY(type_src=f32,type_dst=f32,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1): 528 runs - 1952.04 us/run - 1572864 kB/run - 785.89 GB/s CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1): 731 runs - 1381.74 us/run - 786432 kB/run - 549.11 GB/s CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1): 1118 runs - 926.10 us/run - 786432 kB/run - 819.27 GB/s CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1): 1075 runs - 934.95 us/run - 786432 kB/run - 811.51 GB/s

Ended up with padding instead of swizzling. Padding is better since smem requirement is loose.

bssrdf added 2 commits October 28, 2025 22:36

WIP

5afac4d

added a cpy kernel specific to transposed tensor which uses smem to a…

30d4607

…void uncoalesced access; test cases also added shwoing improved memory bandwidth

bssrdf requested a review from slaren as a code owner October 29, 2025 12:54

added BF16 support

d3bdcf8

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025

more strict check to make sure src0 is a transpose

18818a2

JohannesGaessler reviewed Oct 30, 2025

View reviewed changes

bssrdf and others added 7 commits October 30, 2025 15:20

reformulated to handle more complicated transpose cases

35daa02

merged ok

29387ce

bring back 2D transpose for higher performance

d2ec251

allow build on windows

3809645

tranpose copy more shapes

c36b70b

minor tweak

90fd992

final clean up

d49232c

bssrdf added 2 commits November 2, 2025 07:49

restore some test cases

8dbb4c7

Merge branch 'cuda-transpose-cpy' of github.com:bssrdf/llama.cpp into…

351cf56

… cuda-transpose-cpy

JohannesGaessler reviewed Nov 3, 2025

View reviewed changes

keep only the kernel for true tranposed case; updated with review sug…

28e5cf6

…gestions

make CI happy

1f8e4c0

JohannesGaessler approved these changes Nov 4, 2025

View reviewed changes

bssrdf added 5 commits November 4, 2025 07:47

remove headers not needed

e909afd

reduced bank conflicts for fp16 and bf16

3b8100c

add missing const*

51a2590

now bank conflicts free

2eb5117

use padding instead of swizzling

bc95e58

improve CUDA cpy memory bandwidth when copying transposed tensor #16841

Are you sure you want to change the base?

improve CUDA cpy memory bandwidth when copying transposed tensor #16841

Conversation

bssrdf commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

bssrdf commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

am17an commented Oct 29, 2025

Uh oh!

bssrdf commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Oct 30, 2025

Uh oh!

JohannesGaessler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bssrdf commented Oct 30, 2025

Uh oh!

JohannesGaessler commented Nov 2, 2025

Uh oh!

bssrdf commented Nov 2, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bssrdf commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bssrdf commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

bssrdf Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

bssrdf Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

bssrdf Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bssrdf commented Oct 29, 2025 •

edited

Loading

bssrdf commented Oct 29, 2025 •

edited

Loading

JohannesGaessler left a comment •

edited

Loading

bssrdf commented Nov 3, 2025 •

edited

Loading

bssrdf commented Nov 3, 2025 •

edited

Loading